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1. Introduction 


1.1 Societal Applications and Importance of Speech Recognition 

Speech recognition technology has been influential to society for over a decade. 
The most obvious application of speech recognition is digital voice assistants such as 
Siri (Apple) and Alexa (Amazon). Integrated into many people's lives through 
smartphones and home assistance devices, virtual assistants help out with digital tasks 
like web searches, messages, and calls, as well as smart tasks such as daily reminders 
and health tracking. Advanced speech recognition software is a big factor in the 
convenience of digital voice assistants. 

Other speech recognition systems integrated into different platforms also allow 
for similar convenience. For example, Google and YouTube search both have a “search 
by voice” option for those who type slowly, or are unable to type due to mental or 
physical challenges. Google docs provides a “voice typing” option that speeds up the 
process of writing papers. Google translate, Duolingo and other foreign language 
learning platforms utilize speech recognition to analyze and enhance students’ oral 


speaking skills. 


1.2 Computational Approach to Audio Waves and Audio Processing Tools 
Automatic speech recognition systems have been around since the beginning of 

the 21st century. However, it is only in recent years that speech recognition software 

has made significant advancements and gained popular traction, thanks to the deep 


learning (DL) subset of artificial intelligence (Al). The concept of speech recognition is 


simple; a computer converts the spoken audio detected by the microphone into written 
text. However, this process is much more complicated in practice. 

To begin with, sound is originally in the form of an analog wave picked up by the 
microphone, representing the longitudinal waves that move through the air. For 
computational use, the analog wave is converted into a series of digital values through a 
process called analog to digital conversion. The resulting array of binary numbers 
sampled at the correct rate give a relatively accurate representation of the original 
sound wave. (Anvarjon) 

There are many ways to represent sound. Most commonly, amplitude-time 
graphs are used to display audio files in audio editing softwares (Figure 0). In these 
graphs, the amplitude, which corresponds to the intensity, is scaled and plotted against 
time (in seconds). In comparison, spectrograms plot frequency against time, and use a 
range of colors to display sound intensity at different frequencies (Figure 0.5). Every 
representation of audio provides some valuable information, but no single graph can 
show every property of a sound wave. Hence, each of these graphs are often used in a 


specific audio processing application. ("Understanding") 
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Figure O. Amplitude-Time graph of a .wav audio file’ 
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Figure 0.5. Spectrogram with a dBFS range of -100 to -20 (dark purple to light yellow), a 


10 kHz maximum frequency, and 100 millisecond time intervals? 


! Nathan, Mathura. "Plot Audio File as Time Series Using Scipy Python." GaussianWaves, 2 Aug. 2020, 
www.gaussianwaves.com/2020/01/how-to-plot-audio-files-as-time-series-using-scipy-python/. 


? "Understanding Spectrograms." /Zotope, 11 Apr. 2019, 
www.izotope.com/en/learn/understanding-spectrograms.html. 


1.3 The Advantages of Deep Learning for Speech Recognition 

The reason DL methods involving neural networks are preferred over traditional 
speech recognition algorithms is because of the efficiency and flexibility of DL. Due to 
their nature, neural networks can successfully identify spoken syllables and words with 
a range of variation, which may be caused by difference in voice, accent, or simply 
mispronunciation. Additionally, thanks to advancements in the programming language 
python and many of the cloud based integrated development environments (IDEs), it 
has become exceptionally easy to code neural networks with less powerful computers, 
thus making the application of DL more convenient. 

One question that comes up while planning to develop a speech recognition 
system using deep learning is “what type of neural network is best for the task?” Some 
of the most commonly used neural networks for speech recognition are convolutional 
and recurrent neural networks. In this extended essay, | will be describing them and 
experimenting with both to ultimately answer the question of which is more accurate and 
faster for speech recognition: convolutional neural networks or recurrent neural 


networks? 


2. Theoretical Background 


2.1 Neural Networks 

Neural networks (NNs) are a subcategory of machine learning (ML) that deal with 
the tasks of classifying or creating data by making a model that learns from particular 
sets of inputs. NNs consist of individual nodes connected in multiple layers, each of 
which manipulate the data as it passes from the input layer, through the hidden layer, 
and into the output layer (An). At a given layer, the nodes of a NN weigh the input data, 
sum up all of the values, add a bias, and pass the result through an activation function 
and into the next layer (Figure 1). From a mathematical approach, this process can be 
demonstrated using matrix operations. The input matrix is multiplied with the weights 
matrix, and added with the bias matrix (Jordan). Each result is passed through an 


activation function (eg. Sigmoid, ReLu, Softmax) to create the output matrix (Equation 


1). 


Output 


f Sum Activation 
@ W, Function 


Figure 1. Individual neuron of a Feed Forward NN? 


3 An, Sungtae. “Feedforward Neural Networks." Sungtae's Awesome Homepage, Georgia Institute of 
Technology, 8 Oct. 2017, www.cc.gatech.edu/-san37/post/dlhc-fnn/. 
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Equation 1. The matrix operations that demonstrate the workings of a Feed Forward 
layer’ 

This is the foundational idea behind a simple “feed forward” neural network 
model. The NN “learns” by adjusting the weights throughout layers until the outputs are 
accurate with respect to what is expected as a correct output (Jordan). Each weight 
represents the so-called importance or influence of the data which passes through that 
point, and can be increased or decreased as the neural network sees fit (An). A larger 
weight places more significance on the information for that input, and vice versa. Each 
node also has a bias, which is a constant added to the weighted input. The bias at each 
node for every layer can be adjusted as well to give the neural network more precision 
(An). The activation function takes the resultant value (which is the summation of the 
weighted inputs and the biases), and outputs a value corresponding to the chosen 
function (Jordan). For example, a Sigmoid activation function will take any input and 
pass it through the function 1/(1 + e) so that the output value is between 0 and 1, 
whereas a ReLu activation function will convert negative inputs to zero but leave 
positive inputs unaltered (Figure 2). 


^ Jordan, Jeremy. "Neural Networks: Representation." Jeremy Jordan, 26 Jan. 2018, 
www.jeremyjordan.me/intro-to-neural-networks/. 
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Figure 2. Some of the common activation functions” 

The neural network passes the error for each run through a loss function, which 
generates a loss that is used to change the adjustable variables of the NN’s layers 
(weights and biases) accordingly. It is important to note that the term “error” is defined 
as the deviation from an actual value by a prediction or expectation of that value, 
whereas “loss” is defined as a quantified measure of how consequential it is to get an 
error of a particular size or direction®. An example of a common loss function is Mean 
Squared Error (MSE), which squares the errors throughout the dataset and outputs the 
average, and is most commonly used in Linear Regression models (“Introduction”). The 
generated loss is passed through an optimizer algorithm, which adjusts the weights and 
biases of the neural network in order to minimize the loss (Nielsen). One of the most 


basic and heavily used optimizer algorithms is Gradient Descent (Figure 3)(Raschka). It 


5 "Introduction to Loss Functions.” A/gorithmia Blog, 28 Apr. 2021, 
algorithmia.com/blog/introduction-to-loss-functions#types-of-loss-functions. 

ê Errors and losses are explained in detail by Michael Nielsen in his book “Neural Networks and Deep 
Learning.” 
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uses the first order derivative of the loss function, and calculates how the weights 


should be adjusted for the function to reach a minima (local or global).’ 


J(w) — Gradient 
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Figure 3. A graphical representation of the Gradient Descent optimizer algorithm’s 


conceptual workings? 


2.2 Convolutional Neural Networks 

Convolutional Neural Networks (CNNs) have specialization for picking out 
patterns and deriving meaning from them, and thus they are used almost always for 
image analysis (Saha). A CNN differs from a traditional Multilayer Perceptron NN 
because of the convolutional layers. To begin with, most images come in either a 3D or 


5D format: the height times the breadth (dimensions), and the number of color channels 


7 General information about NNs and DL can be found at 
https://static.latexstudio.net/article/2018/0912/neuralnetworksanddeeplearning.pdf 

8 Raschka, Sebastian. “Gradient Descent and Stochastic Gradient Descent.” Gradient Descent and 
Stochastic Gradient Descent - Mixtend, Mlxtend, 2014, 
rasbt.github.io/mlxtend/user guide/general concepts/gradient-optimization/. 
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(3 for RGB, 1 for grayscale). In grayscale, each pixel has a value between 0 and 255. 
Feeding this information into a standard Feed Forward NN is inconvenient because 
corresponding a node to each pixel overwhelms the computer and significantly slows 
down the training process. Furthermore, hardwiring each pixel to a node only works for 
identification of objects of the exact same scale and at the exact same position in the 
image. If the image was scaled by a small factor, or rotated by a small angle, the simple 
Feed Forward neural network would fail to classify it successfully. 

In comparison, a convolutional layer uses simple filters (aka kernels), which are a 
grid of pixels with specific values adjusted throughout training, to identify special 
features in the image (like lines and curves). By sliding each filter through the image, 
from the top left to the bottom right corner, a CNN is able to capture the spatial and 
temporal properties of an image, thus classifying it more successfully, even with small 
changes such as scaling and rotating (Saha). The convolution operation works by lining 
up the filter with a same size patch in the image, multiplying the pixel values of the filter 
with the corresponding pixel values in the chosen patch of the image, summing the 
resultant values for each multiplication, and dividing the sum by the number of pixels 
(averaging). The result is stored in the location of the center pixel for a newly created 
image of corresponding size to the original, called the feature map, which is passed on 
to the next layer (Sewak). The pixel values of the individual filters are essentially the 
“weights” in a convolutional layer, and are adjusted through training. 

Nevertheless, multiple convolution operations take a relatively long time to 
complete, so to scale down the data, CNNs also use pooling layers, which reduce the 


pixel count between convolutional layers (depending on the pooling algorithm). Some 
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common pooling options are Max Pooling (which picks out the maximum value in a 
small grid of pixels), Min Pooling (the opposite of the former), and Average Pooling 
(Saha). Finally, following the last convolutional layer, the CNN “Flattens” the data by 
changing it from a grid (which is the shape of the feature map) to an array format. The 
array is fed into a traditional Dense layer, consisting of an output layer that classifies the 


data (Figure 4). ° 


fc 3 fc 4 
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Neural Network Neural Network 
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Figure 4. Visualization of a traditional CNN'? 


? The following book explains the types, workings, and applications of CNNs in depth: 
https://www.google.com/books/edition/Practical Convolutional Neural Networks/bOIODWAAQBAJ?hl=en 
&gbpv=0 

10 Saha, Sumit. “A Comprehensive Guide to Convolutional Neural Networks - the eli5 Way." Towards Data 
Science, 17 Dec. 2018, 
towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b 11 
64a53. 
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2.3 Recurrent Neural Networks 

Recurrent Neural Networks (RNNs) specialize in modeling sequential data 
through the use of memory. Thus, they’re often used for audio and natural language 
processing (Venkatachalam). An RNN works by looping previous information forward to 
the next layers. In a Multilayer Perceptron, each neuron in a layer uses only the input 
data to produce an output. However, RNN layers also consider previous data that has 
passed through the network before producing an output. In addition to weighting the 
input data, the neurons weigh the previous data (Figure 5). This small difference allows 


RNNs to make sense of sequences much more efficiently. 


Yı Y2 ys 
W, Wy Wy 
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Figure 5. The structure of an RNN” 
There are also multiple types of Recurrent Neural Networks. However, most of 
them suffer from short term memory, due to the vanishing gradient problem (which 
causes the earlier weights in the network to barely adjust through training due to the 


1 Venkatachalam, Mahendran. “Recurrent Neural Networks.” Towards Data Science, 22 June 2019, 
towardsdatascience.com/recurrent-neural-networks-d4642c9bc7ce. 
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nature of backpropagation). Because of this, RNNs often fail to effectively use 
information earlier in the sequence to influence the final classification in the output layer. 
To combat this, Long-Short Term Memory (LSTM) neural networks, which are an 
evolved version of RNNs, are used (Kostadinov). LSTMs expand the scope of the 


memory and don't show a positive bias toward later information in a sequence." 


12 Coding RNNs with Python: 
https://www.google.com/books/edition/Recurrent Neural Networks with Python Qu/cC59DwAAQBAJ?h 
l=en&gbpv=0 
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3. Experiment Methodology 


3.1 Programming Platform, Language, and Libraries 

For the purposes of coding convenience and better visual presentations of 
processes, | used the Jupyter Notebook IDE in this experiment. Jupyter Notebook is 
free, open source, highly interactive, and advantageously structured for programming 
NNs. Each line of code can be programmed and run separately, but are consecutive in 
the execution of the written program. For example, one line can be used to import 
libraries, and another line to define a function. This way, the program is much easier to 
troubleshoot and modify. In the case of NNs, training a model usually takes a relatively 
long period of time. With the structure of Jupyter Notebook, changes in individual 
sections of the code can be made without re-training the NN. 

| used the python programming language for this experiment. Python (v3) is the 
most common language in nearly all fields of machine learning, including deep learning. 
This is because of its high level and simple syntax, and the number of libraries that have 
been created to aid in the programming of NNs. 

The most important library, however, is Tensorflow. It is developed and owned by 
Google, and used for both research and production. Tensorflow provides easy ways to 
define and create multilayer models with various types of layers, including deep, 
convolutional, and recurrent layers. Creating, training, and testing NNs with tensorflow 
is much easier than coding a NN from scratch. Going into how exactly tensorflow works 
is not necessary, as it is quite complicated and overwhelming, but essentially a layer in 
a model in tensorflow is a number of nodes representing mathematical operations in a 


graph, connected by tensors (multidimensional data arrays). 
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3.2 The Dataset 

The dataset for this experiment consists of 3,000 audio samples of the spoken 
digits zero through nine. There are 6 different speakers, and each digit is repeated 50 
times per speaker. The dataset is free for use online on GitHub", and is of course in 
English. The training and testing data are split 9096 to 1096, meaning there are 2,700 
samples for training and 300 for testing. Each sample is an approximately 1.5 second 
WAV audio file that is labeled as follows: (the digit spoken} (name of the 
speaker) {iteration of the digit}. There are also additional python programs included 
along with the dataset that trim the silence out of each sample, convert the WAV audio 
file into a grayscale JPEG spectrogram image of size 64 pixels by 64 pixels, and cross 
validate the data (split between testing and training). On a separate file, the metadata of 


the speakers are provided (including the name, gender, and accent). 


3.3 CNN Preprocessing and Model Architecture 

| used the 64x64, grayscale spectrogram pictures of the audio files as input to the 
CNN. A python function 'create train data() defines the two arrays 'training data' and 
training labels.’ Using a simple ‘for’ loop, for every image in the training directory, the 
function opens the image utilizing the PIL library (python interpreter with image editing 
capabilities) and saves each pixel value in the grayscale image to the training data' 
array. Then, another function "label img(img)' is called to get the label (spoken digit) of 
the image, which is then added to the 'training labels' array. The function returns the 
two arrays. The training data array is reshaped into the shape (2700, 64, 64, 1), with the 


13 https://github.com/Jakobovski/free-spoken-digit-dataset 
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2700 representing the number of images, the first 64 representing the number of rows 
of pixels, the second 64 representing the number of pixels per row, and the 1 
representing the number of color dimensions (1 for a grayscale value between 0 and 
255). This procedure is repeated for the testing portion of the dataset. Finally, each 
value in the training and testing data arrays are divided by 255 for the purposes of 
normalization. It is much easier for NNs to operate when all weights, variables, and 
inputs are within the same range. In order to achieve this, a process called 
normalization is used, where every value in a dataset is divided by the same constant 
so that each value is in a 0 to 1 range. In this case, 255 is the maximum value any of 
the pixels can have, so that is the constant. 

The convolutional neural network is first defined as a sequential model (as most 
other NNs), which requires the ‘keras’ import from the tensorflow library. We begin by 
defining the number of filters in the layer (64), and the input shape of the image (64x64), 
along with the filter size and activation function. The filter size in this case is 3 by 3, 
which may sound small, but this is perfectly normal given that the input image is also 
very small and blurry. A smaller filter may lead to a much longer training period, and a 
larger filter may fail to train successfully, even through many iterations. The activation 
function used most often in convolutional layers is the rectified linear (ReLu) function. 
This is because ReLu converts negative values to zero, which makes sense because 
pixels shouldn't have negative values, but the function also preserves the positive 
values without converting them to either 1 or 0, as a sigmoid function would. 

The convolutional layer is followed by a pooling layer, which has a grid size of 2 


by 2 pixels. A pooling grid too large will result in the loss of many pixel values that may 
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come in handy for successful classification, but no pooling layer will make the training 
significantly slower. 

There are three more convolutional layers of size 64 (neurons) in the NN, each 
with a 3 by 3 filter, a relu activation function, and a pooling layer afterwards (except for 
the last one). The last convolutional layer is followed by a flattening layer, which allows 
the data to be then passed on to a deep layer (of same size and activation function), 
and finally into an output layer of size 10; one neuron for each spoken digit. 

The “Adam” optimizer is used, which is a stochastic gradient descent method that 
takes into account the first and second order moments before an estimation. The 
algorithm itself is complicated, but this optimizer performs much better than some others 
that | tested. The “Sparse Categorical Cross Entropy” loss function was used, which 


computes the crossentopy loss between the labels and predictions.'^ 


3.4 RNN Preprocessing and Model Architecture 

| chose to use the spectrograms of the audio samples as input to the RNN for a 
more fair comparison. Because of this, the preprocessing for the RNN is the same as 
that for the CNN. The only difference is that instead of a grid, the image is in an array 
format. Creating the model for the recurrent neural network is also a similar process. 
The model is first defined as sequential, then the individual recurrent layers are added. 
The arguments passed into the first recurrent layer are as follows: the number of 
neurons (128), the input data shape (64 by 64), the activation function (rectified linear), 
and a boolean called ‘return_sequences.’ As mentioned in section 2.3, RNNs are 


^ The code for my CNN is a modification of the code from this tutorial: 
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAluwS 
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special because they loop data. The ‘return_sequences’ variable lets the network know 
whether the current sequence of data should be kept for looping. If there is another 
recurrent layer after the current one, this boolean variable will be set true. This RNN 
consists of two recurrent layers, so this variable will be set true for the first layer. The 
second recurrent layer also has 128 neurons and a relu activation function. Next, there 
is a dense layer of 32 neurons with a ReLu activation function, and finally an output 
layer with a softmax activation function. The softmax function is used to ensure that all 
of the probabilities in the output layer combined add up to one, so that the output with 
the highest probability is chosen. 

After every hidden layer, there is a “dropout” layer. Dropout is a technique used 
to limit overfitting. Though all types of neural networks are naturally prone to overfitting, 
RNNs and Deep Neural Networks (DNNs) are highly vulnerable, especially over many 
epochs. So, during every run, weights at a given layer are chosen at random and 
multiplied with the dropout constant. This puts less significance on those weights, and 
thus prohibits the NN from over-relying on them. 

For the sake of a fair comparison, the optimizer and loss function for the RNN are 


the same as that of the CNN (the Adam optimizer, and the SCGD loss function). 


3.5 The Independent Variable 
The only independent variable in this experiment that is shared by both neural networks 


is the number of epochs. The independent variables specific to the CNN are the filter 


15 The code for my RNN is a modification of the code from this tutorial: 
https://pythonprogramming.net/recurrent-neural-network-deep-learning-python-tensorflow-keras/ 
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size and the pooling layer grid size. The independent variable specific to the RNN is the 


dropout constant. 


3.6 The Dependent Variables 

There are two dependent variables in this experiment: training duration and test dataset 
accuracy. Training duration is the amount of time the NN takes to train (calculated using 
the duration per epoch), and test dataset accuracy is the percentage accuracy of the 


model performing on the test dataset. 


3.7 The Hypothesis 

| have never before coded recurrent or convolutional neural networks, and therefore am 
not sure what to expect for the speed of each, in terms of the magnitude of time they will 
take to train (whether they will train in seconds, minutes, or hours). | am confident, 
however, that the RNN will train faster, because the CNN has more weights to adjust, 
due to the nature of its filters. In terms of accuracy, however, | believe that the two will 
be close, with the RNN slightly beating the CNN. | believe this because memory is 
important for audio classification, since audio is sequential data. Therefore, the absence 


of memory is a disadvantage to the CNN in my eyes. 


4. Experiment Result Analysis and Conclusion 


4.1 CNN Results 
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Following 10 epochs, the CNN yielded an accuracy of 94.67% over a training 


period of around 125 seconds. Following 16 epochs, the CNN yielded an accuracy of 


96.67% over a training period of around 201 seconds. Following 24 epochs, the CNN 


yielded an accuracy of 96% over a training period of around 298 seconds. 


Epoch 1/24 

85/85 I 

Epoch 2/24 

85/85 [2222222-2-2---------------------] - 

Epoch 3/24 

"C M p ———M—HRÓ 

Epoch 4/24 

85/85 [een MÀS 

Epoch 5/24 

85/85 [==============================) - 

Epoch 6/24 

85/85 [--------------------------- — e 

Epoch 7/24 

85/85 [===================== - 

Epoch 8/24 

85/85 [======================== ===] - 

Epoch 9/24 

85/85 [2222222-2-2-2-2-------------------] = 

Epoch 10/24 

"LUC. ————— M ETÓ 

Epoch 11/24 

887681 | — 

Epoch 12/24 

85/85 [------------------------------ d = 

Epoch 13/24 

85/85 [========= E ====] - 

Epoch 14/24 

LJ. LA ee EG 

Epoch 15/24 

85/85 [------------------------------] - 

Epoch 16/24 

85/85 [mmm mmmÀÀMá— ] - 

Epoch 17/24 

3 

Epoch 18/24 

85/85 [==============================] — 

Epoch 19/24 

85/85 [2222222-2-2-2--------------------] es 

Epoch 20/24 

85/85 [--------- e J = 

Epoch 21/24 

85/85 [------------------------------] - 

Epoch 22/24 

85/85 [==============================] - 

Epoch 23/24 

85/85 | ] - 

Epoch 24/24 

Hs/E5 [--—————————————— ee 
10/10 - 1s - loss: 


0.9599999785423279 


13s 


13s 


13s 


12s 


12s 


13s 


13s 


12s 


12s 


12s 


12s 


12s 


12s 


12s 


12s 


12s 


12s 


12s 


12s 


12s 


13s 


12s 


12s 


12s 


146ms/step 
155ms/step 
151ms/step 
143ms/step 
145ms/step 
148ms/step 
147ms/step 
144ms/step 
144ms/step 
144ms/step 
144ms/step 
143ms/step 
143ms/step 
143ms/step 
144ms/step 
143ms/step 
146ms/step 
145ms/step 
144ms/step 
146ms/step 
150ms/step 
146ms/step 
144ms/step 


145ms/step 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


2.0011 


0.9295 


90.5796 


8.3424 


0.2502 


6.1847 


9.1495 


0.1385 


0.0985 


9.0689 


0.0712 


9.0721 


90.0592 


90.0463 


90.90423 


0.0536 


0.0365 


6.0245 


90.0239 


90.0165 


9.09096 


90.0314 


0.0576 


0.0175 


0.1588 - accuracy: 0.9600 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


0.2567 


0.6730 


0.8111 


0.8393 


9.9152 


0.9437 


0.9439 


0.9515 


6.9674 


0.9778 


0.9737 


0.9748 


0.9315 


0.9341 


0.9867 


0.9815 


0.9370 


9.9911 


0.9926 


6.9948 


0.9967 


06.9881 


6.9864 


6.9948 


4.2 RNN Results 
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Following 10 epochs, the RNN yielded an accuracy of 29% over a training period 


of around 71 seconds. Following 16 epochs, the RNN yielded an accuracy of 87% over 


a training period of around 110 seconds. Following 24 epochs, the RNN yielded an 


accuracy of 90.67% over a training period of around 165 seconds. 


Epoch 1/24 
85/85 I 
Epoch 2/24 
85/85 [z2222222--———--—---------------] 
Epoch 3/24 
85/85 [222222222-2-2--------------------] 
Epoch 4/24 
85/85 1— —— MÀ ] 
Epoch 5/24 
85/85 1— .. MÀ 9 ] 
Epoch 6/24 
85/85 1— —— — ] 
Epoch 7/24 
85/85 [==============================] 
Epoch 8/24 
85/85 [------------------------------] 
Epoch 9/24 
85/85 [z2222222-2-22-2------------------] 
Epoch 10/24 
85/85 [22222222-2-2-2---2-----------------] 
Epoch 11/24 
85/85 1—— — — ] 
Epoch 12/24 
85/85 1—— — 1 
Epoch 13/24 
85/85 1— — 1 
Epoch 14/24 
85/85 [------------------------------] 
Epoch 15/24 
85/85 12 
Epoch 16/24 
85/85 [222222222-2-2-2--2----------------] 
Epoch 17/24 
85/85 [22222222-2---------------------] 
Epoch 18/24 
85/85 [========== === — — ] 
Epoch 19/24 
85/85 1— ——— ) 
Epoch 26/24 
85/85 — —— M — ] 
Epoch 21/24 
85/85 [------------------------------] 
Epoch 22/24 
85/85 [zzz2zmm—————————Á—————————] 
Epoch 23/24 
85/85 [222222222-2-2--2---2----------------] 
Epoch 24/24 
85/85 1— —— — 


9s 


7s 


7s 


8ems/step 
79ms/step 
79ms/step 
79ms/step 
8ems/step 
82ms/step 
82ms/step 
87ms/step 
84ms/step 
81ms/step 
80ms/ step 
Sems/step 
79ms/step 
79ms/step 
79ms/step 
79ms/step 
8ems/step 
Seems / step 
8ims/step 
79ms/step 
8ems/step 
79ms/step 
82ms/step 


8ems/step 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


loss: 


2.3265 


2.2486 


2.0432 


1.5871 


1.3630 


1.1781 


1.4441 


. 8484 


9.7543 


0.6766 


0.6452 


9.5533 


9.4820 


9. 7497 


9.5311 


0.4321 


0.4428 


8.4451 


9.3537 


9.3475 


9.3149 


9.3262 


0.3207 


9.2828 


10/10 - 1s - loss: 0.2440 - accuracy: 0.9067 


9. 9066666960716248 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 


0.1300 


9.1748 


9.2537 


9.4181 


0.5085 


0.5807 


9. 6019 


9. 7041 


9.7415 


9.7633 


9.7811 


. 8222 


9. 8348 


9.7489 


0.8274 


9. 8530 


9.8493 


e.8600 


9. 88302 


9.8844 


9. 8944 


9. 8930 


9.8944 


9.9033 


24 


4.3 Comparison and Analysis 


28 95.33 


My CNN’s Performance Table and Graph (epoch vs accuracy) made using the Desmos 


Graphing Calculator, with x, representing epochs and y, the test dataset accuracy 


100 


My RNN’s Performance Table and Graph (epoch vs accuracy) made using the Desmos 


Graphing Calculator, with x, representing epochs and y, the test dataset accuracy 
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The accuracy of the CNN was significantly greater than that of the RNN for all 
three trials. The accuracy for the CNN increased on trial two (16 epochs), but decreased 
on trial three (24 epochs). This was likely because of slight overfitting to the training 
dataset, since 24 epochs is relatively large for a dataset of 2700 samples. The accuracy 
of the RNN was surprisingly low for trial one (10 epochs), but increased greatly from trial 
one to trial two. This suggests that the RNN has a “warm up” period during which the 
loss function and the optimizer slowly adjust the weights in different directions, followed 
by a period of much more progressive set of epochs. The accuracy of the RNN 
continued to increase for trial 3. This shows that the dropout method worked 
successfully to prevent overfitting the small training dataset. | would hypothesize that 
more epochs would continue to increase the RNN's accuracy. 

The training duration of the RNN was significantly shorter than that of the CNN 
for all three trials. This was likely because the weights of the RNN are simple, classical 
NN weights, whereas the weights of the CNN are pixel values between 0 and 255 for 
each filter of size 3x3. 

Finally, an important factor was the amount of troubleshooting each NN took to 
work successfully. | was able to code CNN and achieve the above accuracy in under 8 


hours. In comparison, the RNN took over 20 man hours to code. 


4.4 Conclusion 
The final drop in accuracy of the CNN due to overfitting leads me to conclude 


that, keeping the size of the dataset constant, there is a limit to the CNN’s accuracy (in 


26 


this case, roughly 97%), achieved at the right number of epochs. Less epochs will not 
be enough to reach this maximum, and more will lead to overfitting. This is not the case 
for the RNN. Because the accuracy of the RNN continued to rise with an increase in 
epochs, | believe that the RNN will be able to come closer to 100% accuracy, but over a 
very large number of epochs. However, because the CNN was more accurate for each 
trial, | will conclude that it is more accurate in speech recognition applications. 
Nevertheless, the difference in training duration proves that the CNN was slower than 
the RNN when training on the same dataset. Thus, the RNN is a faster approach to 
speech recognition, because it is easier to adjust variables such as the layer size, the 
optimizer algorithm, and the loss function, and see the corresponding effects on training 


and testing accuracy for the RNN. 
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Appendices 


Python Code for the CNN 


1 #!/usr/bin/env python 
2 # coding: utf-8 


3 


4 * In[1]: 


5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 


17 


18 
19 
28 
21 
22 
23 
24 
25 
26 
27 
28 
29 


import numpy as np 

import os 

import sys 

from random import shuffle 
from tqdm import tqdm 

import cv2 

import matplotlib.pyplot as plt 
import PIL 


TRAIN_DIR = ‘C:/Users/yigit/Desktop/free-spoken-digit-dataset-master/training- 
spectrograms ' 

TEST_DIR = 'C:/Users/yigit/Desktop/free-spoken-digit-dataset-master/testing- 
spectrograms ' 


img_count = 2700 

trainsize = 64 * 64 * img count 
img test = 300 

testsize = 64 * 64 * img test 


x train = np.arange(trainsize) 
x labels = np.arange(img count) 
y test = np.arange(testsize) 

y labels = np.arange(img test) 


30 * In[2]: 
31 


32 

33 def label img(img): 

34 word label = ing. split () I-] 

35 label = int(word label) 

36 return label 

37 

38 

39 * In[3]: 

48 

41 

42 def create train data(): 

43 training data = [] 

44 training labels = [] 

45 count = 8 

46 for img in tqdm(os.listdir(TRAIN DIR)): 

47 count = count + 1 

48 label = label img(img) 

49 path = os.path.join(TRAIN DIR, img) 

50 image = PIL.Image.open('C:/Users/yigit/Desktop/free-spoken-digit-dataset- 
master/training-spectrograms/' + img) 

51 sequence = image.getdata() 

52 | array = np.array(sequence) 

53 training data.append(image array) 

54 training labels.append(label img(img)) 

55 

56 

57 return training data, training labels 


29 


30 


58 

59 

60 & In[4]: 

61 

62 

63 training, labels = create train data() 


69 def create test data(): 

70 testing_data = [] 

71 testing_labels = [] 

72 count = @ 

73 for img in tqdm(os.listdir(TEST_DIR)): 


74 count = count + 1 

75 label = label_img(img) 

76 path = os.path.join(TRAIN DIR, img) 

77 image = PIL.Image.open('C:/Users/yigit/Desktop/free-spoken-digit-dataset- 
master/testing-spectrograms/' + img) 

78 sequence = image.getdata() 

79 image_array = np.array(sequence) 

se testing data.append(image array) 

81 testing labels.append(label img(img)) 

82 

83 

84 return testing data, testing labels 

85 

86 

87 * In[6]: 

88 

89 

90 testing, tags = create test data() 

91 

92 

93 * In[7]: 

94 

95 


96 traincount = @ 

97 for i in range(len(training)): 

98 for j in range(len(training[i])): 
99 value = training[i][j][?] 
100 x_train[traincount] = value 
101 traincount = traincount + 1 


104 # In[8]: 

105 

106 

107 x_train = x_train.reshape(img count, 64, 64, 1) 
168 

109 

116 # In[9]: 

111 

112 

113 for i in range(img count): 
114 x labels[i] = labels[i] 
115 


117 * Tn[18]- 

118 

119 

120 x labels = x labels.reshape(img count, 1) 
121 

122 

123 * In[11]: 

124 

125 

126 testcount = 8 

127 for i in range(len(testing)): 

128 for j in range(len(testing[i])): 
129 value » testing[i][j][?] 

130 y_test[testcount] = value 

131 testcount = testcount + 1 

132 

133 

134 # In[12]: 

135 

136 

137 y test = y test.reshape(img test, 64, 64, 1) 
138 

139 

140 * In[13]: 

141 

142 

143 for i in range(img test): 

144 y labels[i] = tags[i] 

145 

146 

147 # In[14]: 

148 

149 

150 y labels = y labels.reshape(img test, 1) 
151 

152 

153 * In[15]: 

154 

155 

156 x train, y test = x train / 255.0, y test / 255.0 
157 

158 

159 * In[16]: 

168 

161 

162 class names -[ e., ‘1°, ‘2°, '3', '4', 
163 VM ULL LINES NEL 
164 

165 

166 * In[17]: 

167 

168 

169 import tensorflow as tf 

178 

171 from tensorflow.keras import datasets, layers, models 
172 

173 

174 * In[18]: 

175 

176 


31 


32 


177 model = models.Sequential() 

178 model.add(layers.Conv2D(64, (3, 3), activations'relu', input_shape=(64, 64, 1))) 
179 model.add(layers.MaxPooling2D((2, 2))) 

180 model.add(layers.Conv2D(64, (3, 3), activations'relu')) 

181 model.add(layers.MaxPooling2D((2, 2))) 

182 model.add(layers.Conv2D(64, (3, 3), activations'relu')) 

183 model.add(layers.MaxPooling2D((2, 2))) 

184 model.add(layers.Conv2D(64, (3, 3), activations'relu')) 


187 * In[20]: 


— 
© 


model.add(layers.Flatten()) 
model.add(layers.Dense(64, activations'relu')) 
2 model.add(layers.Dense(19)) 


þà m 
(uo 
UJ hom 


195 * In[22]: 


198 model.compile(optimizers'adam', 
199 lossstf.keras.losses.SparseCategoricalCrossentropy(from logitssTrue), 
200 metricss['accuracy']) 


203 * In[24]: 
206 model.fit(x train, x labels, epochs = 24) 
209 # In[25]: 


212 test loss, test acc = model.evaluate(y test, y labels, verbose- 2) 
213 print(test acc) 


Terminal Output for the CNN 


1 
J 2700/2700 [00:19«00:00, 141.09it/s] 
rener 
JJ Nr 
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2021-08-10 11:13:01.049707: W 
tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic 
library 'cudart64 110.dll'; dlerror: cudart64_110.dll not found 

2021-08-10 11:13:01.049882: I tensorflow/stream executor/cuda/cudart stub.cc:29] 
Ignore above cudart dlerror if you do not have a GPU set up on your machine. 
2021-08-10 11:13:05.075755: W 

tensorflow/stream executor/platform/default/dso loader.cc:64] Could not load dynamic 
library 'nvcuda.dll'; dlerror: nvcuda.dll not found 

2021-08-10 11:13:05.075906: W tensorflow/stream executor/cuda/cuda driver.cc:326] 
failed call to culnit: UNKNOWN ERROR (303) 

2021-08-10 11:13:05.081306: | 

tensorflow/stream executor/cuda/cuda diagnostics.cc:169] retrieving CUDA diagnostic 
information for host: DESKTOP-GJ6KHSP 

2021-08-10 11:13:05.081560: | 
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-GJ6 
KHSP 

2021-08-10 11:13:05.081974: | tensorflow/core/platform/cpu_feature_guard.cc:142] This 
TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to 
use the following CPU instructions in performance-critical operations: AVX 

To enable them in other operations, rebuild TensorFlow with the appropriate compiler 


flags. 


2021-08-10 11:13:07.956747: | 
tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of 
Optimization Passes are enabled (registered 2) 


Epoch 1/24 


accuracy: 0.2815 


Epoch 2/24 


accuracy: 0.6937 


Epoch 3/24 


accuracy: 0.8248 


Epoch 4/24 


accuracy: 0.8822 


Epoch 5/24 


accuracy: 0.9289 


Epoch 6/24 


accuracy: 0.9452 


Epoch 7/24 


the MLIR 


: 1.9459 - 


: 0.9275 - 


: 0.5315 - 


: 0.3486 - 


: 0.2183 - 


: 0.1653 - 


accuracy: 0.9578 


Epoch 8/24 


accuracy: 0.9648 


Epoch 9/24 


accuracy: 0.9670 


Epoch 10/24 


accuracy: 0.9704 


Epoch 11/24 


accuracy: 0.9789 


Epoch 12/24 


accuracy: 0.9819 


Epoch 13/24 


accuracy: 0.9822 


Epoch 14/24 


accuracy: 0.9874 


: 0.1254 - 


: 0.1079 - 


: 0.0919 - 


: 0.0848 - 


: 0.0658 - 


: 0.0560 - 


: 0.0595 - 


: 0.0404 - 
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Epoch 15/24 


accuracy: 0.9904 


Epoch 16/24 


accuracy: 0.9844 


Epoch 17/24 


accuracy: 0.9807 


Epoch 18/24 


accuracy: 0.9881 


Epoch 19/24 


accuracy: 0.9889 


Epoch 20/24 


accuracy: 0.9881 


Epoch 21/24 


accuracy: 0.9889 


Epoch 22/24 


: 0.0297 - 


: 0.0410 - 


: 0.0642 - 


: 0.0344 - 


: 0.0326 - 


: 0.0344 - 


: 0.0370 - 
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accuracy: 0.9956 


Epoch 23/24 


accuracy: 0.9981 


Epoch 24/24 


accuracy: 0.9922 
10/10 - 1s - loss: 0.1988 - accuracy: 0.9667 


0.9666666388511658 
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Python Code for the RNN 


1 #!/usr/bin/env python 
2 # coding: utf-8 

3 

4 * In[1]: 

5 


6 

7 import numpy as np 

8 import os 

9 import sys 

10 from random import shuffle 

11 from tqdm import tqdm 

12 import cv2 

13 import matplotlib.pyplot as plt 
14 import PIL 


15 

16 TRAIN DIR = 'C:/Users/yigit/Desktop/free-spoken-digit-dataset-master/training- 
spectrograms' 

17 TEST DIR = 'C:/Users/yigit/Desktop/free-spoken-digit-dataset-master/testing- 
spectrograms‘ 

18 


19 img count = 2700 

20 trainsize = 64 * 64 * img count 
21 img test = 300 

22 testsize = 64 * 64 * img test 
23 

24 x_train = np.arange(trainsize) 
25 x labels = np.arange(img count) 
26 y test = np.arange(testsize) 

27 y labels = np.arange(img test) 
28 

29 

30 * In[2]: 

31 


32 

33 def label img(img): 

34 word label = img.split(' ')[-e] 
35 label = int(word label) 

36 return label 

37 

38 

39 # In[3]: 


42 def create train data(): 

43 training data = [] 

44 training labels = [] 

45 count = 8 

46 for img in tqdm(os.listdir(TRAIN DIR)): 


47 count = count + 1 

48 label = label img(img) 

49 path = os.path.join(TRAIN DIR, img) 

50 image = PIL.Image.open('C:/Users/yigit/Desktop/free-spoken-digit-dataset- 
master/training-spectrograms/' + img) 

51 sequence = image.getdata() 

52 array = np.array(sequence) 

53 training data.append(image array) 

54 training labels.append(label img(img)) 

55 

56 


57 return training data, training labels 
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# In[4]: 
training, labels = create_train_data() 
# In[5]: 


def create_test_data(): 
testing data = [I 
testing labels = [] 
count = @ 
for img in tqdm(os.listdir(TEST_DIR)): 
count = count + 1 
label = label_img(img) 
path = os.path.join(TRAIN DIR, img) 
image = PIL.Image.open('C: /Users/yigit/Desktop/free-spoken-digit-dataset- 
master/testing-spectrograms/' + img) 
sequence = image.getdata() 
image array = np. s ays 
testing data.append(image arra 
testing labels. nun RR 


return testing data, testing labels 
# In[6]: 
testing, tags = create test data() 
# In[7]: 


traincount = 8 
for i in range(len(training)): 
for j in range(len(training[i])): 
value = training[i][j][?] 
x train[traincount] = value 
traincount = traincount + 1 


# In[9]: 
x_train = x_train.reshape(img count, 64, 64) 
# In[10]: 


for i in range(img count): 
x_labels[i] = labels[i] 
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40 


117 * In[11]: 

118 

119 

126 x_labels = x_labels.reshape(img count, 1) 
121 

122 

123 * In[12]: 

124 

125 

126 print (xX train. shape) 

127 print (labels. shape) 

128 

129 

130 * In[13]: 

131 

132 

133 testcount = 8 

134 for i in range(len(testing)): 

135 for j in range(len(testing[i])): 
136 value = testing[i][j][?] 
137 y test[testcount] = value 
138 testcount = testcount + 1 


141 # In[15]: 

142 

143 

144 y test = y test.reshape(img test, 64, 64) 
145 

146 

147 # In[16]: 

148 

149 

150 for i in range(img test): 

151 y labels[i] = tags[i] 

152 

153 

154 * In[17]: 

155 

156 

157 y labels = y labels.reshape(img test, 1) 
158 

159 

160 * In[18]: 

161 

162 

163 x train, y test = x train / 255.0, y test / 255.0 
164 

165 

166 * In[19]: 

167 

168 

169 class names = [e, '1', '2', '3', '4', 
170 FF 
171 

172 

173 * In[20]: 

174 

175 

176 import tensorflow as tf 
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177 from tensorflow import keras 

178 from tensorflow.keras import models, layers 

179 from tensorflow.keras.models import Sequential 

180 from tensorflow.keras.layers import Dense, Dropout, LSTM 


183 # In[25]: 


186 model = keras.Sequential() 


188 model.add(layers.LSTM(128, input shapes(x train.shape[1:]), activations'relu', 
return sequencessTrue)) 
189 model. add (Dropout (2. 2)) 


191 model. add (layers. LSITM( 128, activations'relu')) 
192 model. add (Dropout (9. 2)) 


194 model. add (layers. Dense (32, activations'relu')) 
195 model. add (Dropout (e. 2)) 
196 model. add (layers. Dense (12, activations'softmax')) 


19 

199 # In[26]: 

202 opt = tf.keras.optimizers.Adam(learning_rate=@.0@1, decays1e-5) 
204 model. compile ( 

205 loss='sparse categorical crossentropy', 


206 optimizer=opt, 
207 metricss['accuracy'], 


211 * In[28]: 
214 model.fit(x train, x labels, epochss24) 
217 * In[29]: 


220 test loss, test acc = model.evaluate(y test, y labels, verbose=2) 
221 print(test acc) 


Terminal Output for the RNN 


11 
EE 2700/2700 [00:18«00:00, 146.43it/s] 
11“ 
JJ PP 


(2700, 64, 64) 
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(2700, 1) 

2021-08-10 11:28:55.410371: W 
tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic 
library 'cudart64 110.dll'; dlerror: cudart64_110.dll not found 

2021-08-10 11:28:55.410536: I tensorflow/stream executor/cuda/cudart stub.cc:29] 
Ignore above cudart dlerror if you do not have a GPU set up on your machine. 
2021-08-10 11:28:58.510744: W 

tensorflow/stream executor/platform/default/dso loader.cc:64] Could not load dynamic 
library 'nvcuda.dll'; dlerror: nvcuda.dll not found 

2021-08-10 11:28:58.511072: W tensorflow/stream executor/cuda/cuda driver.cc:326] 
failed call to culnitt UNKNOWN ERROR (303) 

2021-08-10 11:28:58.518412: | 
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic 
information for host: DESKTOP-GJ6KHSP 

2021-08-10 11:28:58.518682: | 
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 
DESKTOP-GJ6KHSP 

2021-08-10 11:28:58.519533: | tensorflow/core/platform/cpu_feature_guard.cc:142] This 
TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to 
use the following CPU instructions in performance-critical operations: AVX 

To enable them in other operations, rebuild TensorFlow with the appropriate compiler 


flags. 


2021-08-10 11:28:59.272410: | 
tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR 
Optimization Passes are enabled (registered 2) 


Epoch 1/24 


accuracy: 0.1315 

Epoch 2/24 

85/85 [>=============================] - 8s 89ms/step - loss: 2.7080 - 
accuracy: 0.1807 

Epoch 3/24 

85/85 [>=============================] - 7s 81ms/step - loss: 2.4055 - 
accuracy: 0.1837 

Epoch 4/24 

85/85 [F=============================] - 8s 91ms/step - loss: 2.2728 - 
accuracy: 0.1415 

Epoch 5/24 

85/85 [>=============================] - 7s 85ms/step - loss: 2.0897 - 
accuracy: 0.2237 

Epoch 6/24 

85/85 [>=============================] 7s 83ms/step - loss: 1.8103 - 
accuracy: 0.3148 


Epoch 7/24 


accuracy: 0.3733 


Epoch 8/24 


accuracy: 0.1311 


Epoch 9/24 


accuracy: 0.1663 


Epoch 10/24 


accuracy: 0.2093 


Epoch 11/24 


accuracy: 0.2493 


Epoch 12/24 


accuracy: 0.3026 


Epoch 13/24 


accuracy: 0.3804 


Epoch 14/24 


accuracy: 0.3015 


: 3969.2903 - 


: 2.5946 - 


: 2.2648 - 


: 2.0859 - 


: 1.9029 - 


: 1.7761 - 


: 1.6150 - 
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Epoch 15/24 


accuracy: 0.4004 


Epoch 16/24 


accuracy: 0.4630 


Epoch 17/24 


accuracy: 0.4993 


Epoch 18/24 


accuracy: 0.5585 


Epoch 19/24 


accuracy: 0.5948 


Epoch 20/24 


accuracy: 0.6244 


Epoch 21/24 


accuracy: 0.6659 


Epoch 22/24 


: 1.3970 - 


: 1.3381 - 


: 1.2130 - 


: 1.0974 - 


: 1.0235 - 


: 0.9381 - 
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accuracy: 0.6941 


Epoch 23/24 


accuracy: 0.6900 


Epoch 24/24 


accuracy: 0.7074 
10/10 - 1s - loss: 0.5634 - accuracy: 0.8400 


0.8399999737739563 
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